Deconfounded Visual Grounding

نویسندگان

چکیده

We focus on the confounding bias between language and location in visual grounding pipeline, where we find that is major reasoning bottleneck. For example, process usually a trivial languagelocation association without reasoning, e.g., any query containing sheep to nearly central regions, due most queries about have ground-truth locations at image center. First, frame pipeline into causal graph, which shows causalities among image, query, target underlying confounder. Through know how break bottleneck: deconfounded grounding. Second, tackle challenge confounder unobserved general, propose confounder-agnostic approach called: Referring Expression Deconfounder (RED), remove bias. Third, implement RED as simple attention, can be applied method. On popular benchmarks, improves various state-of-the-art methods by significant margin. Code available at: https://github.com/JianqiangH/Deconfounded_VG.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Active Grounding of Visual Situations

We address a key problem for computer vision: retrieving images that are instances of visual situations. Visual situations are concepts such as “a boxing match”, “a birthday party”, “walking the dog”, “a crowd waiting for a bus,” “a handshake”, or “a game of ping-pong,” whose instantiations in images are linked more by their common spatial and semantic structure than by low-level visual similar...

متن کامل

Grounding Visual Explanations (Extended Abstract)

Existing models [2] which generate textual explanations enforce task relevance through a discriminative term loss function, but such mechanisms only weakly constrain mentioned object parts to actually be present in the image. In this paper, a new model is proposed for generating explanations by utilizing localized grounding of constituent phrases in generated explanations to ensure image releva...

متن کامل

Grounding natural language quantifiers in visual attention

The literature on vague quantifiers in English (words like “some”, “many”, etc.) is replete with demonstrations of context effects. Yet little attention has been paid to the issue of where such effects come from. We explore the possibility that such they emanate from a visual attentional bottleneck which limits the accuracy of judgments of number in visual scenes under conditions of time pressu...

متن کامل

Deconfounded Lexicon Induction for Interpretable Social Science

NLP algorithms are increasingly used in computational social science to take linguistic observations and predict outcomes like human preferences or actions. Making these social models transparent and interpretable often requires identifying features in the input that predict outcomes while also controlling for potential confounds. We formalize this need as a new task: inducing a lexicon that is...

متن کامل

Using Visual Information for Grounding and Awareness in Collaborative Tasks

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence

سال: 2022

ISSN: ['2159-5399', '2374-3468']

DOI: https://doi.org/10.1609/aaai.v36i1.19983